{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "6.3. 语言模型数据集（周杰伦专辑歌词）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本节将介绍如何预处理一个语言模型数据集，并将其转换成字符级循环神经网络所需要的输入格式。为此，我们收集了周杰伦从第一张专辑《Jay》到第十张专辑《跨时代》中的歌词，并在后面几节里应用循环神经网络来训练一个语言模型。当模型训练好后，我们就可以用这个模型来创作歌词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.1. 读取数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'想要有直升机\\n想要和你飞到宇宙去\\n想要和你融化在一起\\n融化在宇宙里\\n我每天每天每'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import random\n",
    "import zipfile\n",
    "with zipfile.ZipFile('../data/jaychou_lyrics.txt.zip') as zin:\n",
    "    with zin.open('jaychou_lyrics.txt') as f:\n",
    "        corpus_chars = f.read().decode('utf-8')\n",
    "corpus_chars[:40]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个数据集有6万多个字符。为了打印方便，我们把换行符替换成空格，然后仅使用前1万个字符来训练模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus_chars = corpus_chars.replace('\\n', ' ').replace('\\r', ' ')\n",
    "corpus_chars = corpus_chars[0:10000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.2. 建立字符索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1027"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx_to_char = list(set(corpus_chars))\n",
    "char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])\n",
    "vocab_size = len(char_to_idx)\n",
    "vocab_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之后，将训练数据集中每个字符转化为索引，并打印前20个字符及其对应的索引。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chars: 想要有直升机 想要和你飞到宇宙去 想要和\n",
      "indices: [828, 670, 812, 135, 846, 94, 244, 828, 670, 139, 982, 854, 246, 997, 119, 314, 244, 828, 670, 139]\n"
     ]
    }
   ],
   "source": [
    "corpus_indices = [char_to_idx[char] for char in corpus_chars]\n",
    "sample = corpus_indices[:20]\n",
    "print('chars:', ''.join([idx_to_char[idx] for idx in sample]))\n",
    "print('indices:', sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.3. 时序数据的采样"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在训练中我们需要每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”。我们有两种方式对时序数据进行采样，分别是随机采样和相邻采样。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.3.1. 随机采样\n",
    "下面的代码每次从数据里随机采样一个小批量。其中批量大小batch_size指每个小批量的样本数，num_steps为每个样本所包含的时间步数。 在随机采样中，每个样本是原始序列上任意截取的一段序列。相邻的两个随机小批量在原始序列上的位置不一定相毗邻。因此，我们无法用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态。在训练模型时，每次随机采样前都需要重新初始化隐藏状态。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 本函数已保存在d2lzh包中方便以后使用\n",
    "def data_iter_random(corpus_indices, batch_size, num_steps, ctx=None):\n",
    "    # 减1是因为输出的索引是相应输入的索引加1\n",
    "    num_examples = (len(corpus_indices) - 1) // num_steps\n",
    "    epoch_size = num_examples // batch_size\n",
    "    example_indices = list(range(num_examples))\n",
    "    random.shuffle(example_indices)\n",
    "\n",
    "    # 返回从pos开始的长为num_steps的序列\n",
    "    def _data(pos):\n",
    "        return corpus_indices[pos: pos + num_steps]\n",
    "\n",
    "    for i in range(epoch_size):\n",
    "        # 每次读取batch_size个随机样本\n",
    "        i = i * batch_size\n",
    "        batch_indices = example_indices[i: i + batch_size]\n",
    "        X = [_data(j * num_steps) for j in batch_indices]\n",
    "        Y = [_data(j * num_steps + 1) for j in batch_indices]\n",
    "        yield np.array(X, ctx), np.array(Y, ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们输入一个从0到29的连续整数的人工序列。设批量大小和时间步数分别为2和6。打印随机采样每次读取的小批量样本的输入X和标签Y。可见，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  [[ 6  7  8  9 10 11]\n",
      " [12 13 14 15 16 17]] \n",
      "Y: [[ 7  8  9 10 11 12]\n",
      " [13 14 15 16 17 18]] \n",
      "\n",
      "X:  [[18 19 20 21 22 23]\n",
      " [ 0  1  2  3  4  5]] \n",
      "Y: [[19 20 21 22 23 24]\n",
      " [ 1  2  3  4  5  6]] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "my_seq = list(range(30))\n",
    "for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):\n",
    "    print('X: ', X, '\\nY:', Y, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.3.2. 相邻采样"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 本函数已保存在d2lzh包中方便以后使用\n",
    "def data_iter_consecutive(corpus_indices, batch_size, num_steps, ctx=None):\n",
    "    corpus_indices = np.array(corpus_indices)\n",
    "    data_len = len(corpus_indices)\n",
    "    batch_len = data_len // batch_size\n",
    "    indices = corpus_indices[0: batch_size*batch_len].reshape((\n",
    "        batch_size, batch_len))\n",
    "    epoch_size = (batch_len - 1) // num_steps\n",
    "    for i in range(epoch_size):\n",
    "        i = i * num_steps\n",
    "        X = indices[:, i: i + num_steps]\n",
    "        Y = indices[:, i + 1: i + num_steps + 1]\n",
    "        yield X, Y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  [[ 0  1  2  3  4  5]\n",
      " [15 16 17 18 19 20]] \n",
      "Y: [[ 1  2  3  4  5  6]\n",
      " [16 17 18 19 20 21]] \n",
      "\n",
      "X:  [[ 6  7  8  9 10 11]\n",
      " [21 22 23 24 25 26]] \n",
      "Y: [[ 7  8  9 10 11 12]\n",
      " [22 23 24 25 26 27]] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):\n",
    "    print('X: ', X, '\\nY:', Y, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.3.4. 小结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "时序数据采样方式包括随机采样和相邻采样。使用这两种方式的循环神经网络训练在实现上略有不同。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
